Goto

Collaborating Authors

 audio source




Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion

Kulkarni, Ajinkya, Dowerah, Sandipana, Alumae, Tanel, -Doss, Mathew Magimai.

arXiv.org Artificial Intelligence

Audio deepfakes are acquiring an unprecedented level of realism with advanced AI. While current research focuses on discerning real speech from spoofed speech, tracing the source system is equally crucial. This work proposes a novel audio source tracing system combining deep metric multi-class N-pair loss with Real Emphasis and Fake Dispersion framework, a Conformer classification network, and ensemble score-embedding fusion. The N-pair loss improves discriminative ability, while Real Emphasis and Fake Dispersion enhance robustness by focusing on differentiating real and fake speech patterns. The Conformer network captures both global and local dependencies in the audio signal, crucial for source tracing. The proposed ensemble score-embedding fusion shows an optimal trade-off between in-domain and out-of-domain source tracing scenarios. We evaluate our method using Frechet Distance and standard metrics, demonstrating superior performance in source tracing over the baseline system.


Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience

Jiang, Xilin, Han, Cong, Li, Yinghao Aaron, Mesgarani, Nima

arXiv.org Artificial Intelligence

In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.


Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Tan, Reuben, Ray, Arijit, Burns, Andrea, Plummer, Bryan A., Salamon, Justin, Nieto, Oriol, Russell, Bryan, Saenko, Kate

arXiv.org Artificial Intelligence

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training.


Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

Mukhopadhyay, Soumik, Suri, Saksham, Gadde, Ravi Teja, Shrivastava, Abhinav

arXiv.org Artificial Intelligence

The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fr\'echet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).


Speech recognition using python

#artificialintelligence

Speech Recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to textual information. You have probably seen it on Sci-fi, and personal assistants like Siri, Cortana, and Google Assistant, and other virtual assistants that interact with through voice. These AI assistants in order to understand your voice they need to do speech recognition so as to understand what you have just said. Speech Recognition is a complex process, well I'm not going to teach you how to train a Machine Learning/Deep Learning Model to that, instead, I instruct you how to do that using google speech recognition API. As long as you have the basics of Python you can successfully complete this tutorial and build your own fully functioning speech recognition programs in Python.


Q Acoustics Q Active 200 review: This high-end powered bookshelf audio system delivers impeccable performance

PCWorld

Q Acoustics builds mighty-fine loudspeakers, and for its first self-powered offering, the company could have modified any of its existing designs by bolting on an amplifier and calling it a day. What it has wrought instead is a complete high-end audio system that can accommodate nearly any source: analog or digital, wired or wireless, streaming or locally sourced; one that can be incorporated into any of the most common home-audio and smart-home ecosystems. The Q Active 200 system consists of a pair of self-amplified, wireless two-way bookshelf speakers and the Q Active Control Hub (the company will soon offer the same technology in a tower speaker system, the Q Active 400). The broad range of audio sources the Hub can handle range from a server on your network, to most of the popular streaming services, to a turntable equipped with a moving-magnet cartridge. It can then send that music both to its own speakers and to other audio systems on your network, using Apple AirPlay 2 or Google Chromecast.


OtoWorld: Towards Learning to Separate by Learning to Move

Ranadive, Omkar, Gasser, Grant, Terpay, David, Seetharaman, Prem

arXiv.org Machine Learning

We present OtoWorld, an interactive environment in which agents must learn to listen in order to solve navigational tasks. The purpose of OtoWorld is to facilitate reinforcement learning research in computer audition, where agents must learn to listen to the world around them to navigate. OtoWorld is built on three open source libraries: OpenAI Gym for environment and agent interaction, PyRoomAcoustics for ray-tracing and acoustics simulation, and nussl for training deep computer audition models. OtoWorld is the audio analogue of GridWorld, a simple navigation game. OtoWorld can be easily extended to more complex environments and games. To solve one episode of OtoWorld, an agent must move towards each sounding source in the auditory scene and "turn it off". The agent receives no other input than the current sound of the room. The sources are placed randomly within the room and can vary in number. The agent receives a reward for turning off a source. We present preliminary results on the ability of agents to win at OtoWorld. OtoWorld is open-source and available.


Generate video from any given audio source

#artificialintelligence

Sign in to report inappropriate content. This paper presents a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.